72 research outputs found

    Representing interaction in multiway contingency tables: MIDOVA, CA and log-linear model

    Get PDF
    International audienceBeside CA and log-linear model, issued from the statistics domain, other research streams originating in Artificial Intelligence have coped with the interacting variables problem: we will present here the extension to categorical variables of our results on extracting and statistically validating " itemsets " in boolean datatables. We coined MIDOVA (Multidimensional Interaction Differential of Variation) our method for highlighting and representing complex links between qualitative variables, which includes interaction, well-suited to socio-economic data. We will compare it to the CA and log-linear model approaches, using the same 3-way example as Escofier and her colleagues. We will show that out method is effective for general N-way interactions (N may be far greater than 3), whether symmetrically or not, and results both in easy and detailed interpretability, as CA does, and in statistical significance testing, as the log-linear model does in the case of few variables

    Simuler et épurer pour extraire les motifs sûrs et non redondants

    Get PDF
    International audienceOur goal is twofold: 1) we want to mine the only statistically valid 2-itemsets out of a boolean datatable, 2) on this basis, we want to build the only higher-order non-redundant itemsets compared to their sub-itemsets. For the first task we have designed a randomization test (Tournebool) respectful of the structure of the data variables and independant from the specific distributions of the data. In our test set (193 texts and 888 terms), this leads to a reduction from 400,000 2-itemsets to 4000 significant ones, at the 95% confidence interval. For the second task, we have devised a hierarchical stepwise procedure (MIDOVA) for evaluating the residual amount of variation devoted to higher-order itemsets, yielding new possible positive or negative high-order relations. On our example, this leads to 2300 3-itemsets, 41 4-itemsets, and no higher-order ones, in a computationally efficient way

    A Proposition for Fixing the Dimensionality of a Laplacian Low-rank Approximation of any Binary Data-matrix

    Get PDF
    International audienceLaplacian low-rank approximations are much appreciated in the context of graph spectral methods and Correspondence Analysis. We address here the problem of determining the dimensionality K* of the relevant eigenspace of a general binary datatable by a statistically well-founded method. We propose 1) a general framework for graph adjacency matrices and any rectangular binary matrix, 2) a randomization test for fixing K*. We illustrate with both artificial and real data

    Espace intrinsèque d'un graphe et recherche de communautés

    Get PDF
    12 pNational audienceDetermining the number of relevant dimensions in the eigen-space of a graph Laplacian matrix is a central issue in many spectral graph-mining applications. We tackle here the problem of finding the "right" dimensionality of Laplacian matrices, especially those often encountered in the domains of social or biological graphs: the ones underlying large, sparse, unoriented and unweighted graphs, often endowed with a power-law degree distribution. We present here the application of a randomization test to this problem. We validate our approach first on an artificial sparse and power-law type graph, with two intermingled clusters, then on a real-world social graph ("Football-league"), where the actual, intrinsic dimension appears to be 11 ; we illustrate the optimality of this transformed dataspace both visually and numerically, by means of a density-based clustering technique and a decision tree

    A Randomization Test for extracting Robust Association Rules

    Get PDF
    International audienceAn association rule "if A then B" is a link between database property sets A and B. Since this type of rule is not deduced from hypotheses, but found by investigation in data, association rules extraction belongs to Data Mining techniques (Han et al. 2001). Presently, more than fifty different measures are used to try to establish the quality of association rules, according to their different semantics. It shows the great variety of links between properties expressed by these rules, but also the difficulty of being sure they are meaningful. To test if an association rule is robust, that is to say to determine if the link it brings out is not due to chance, a Randomization Test (Edgington, 1995) is developed. For this, simulations that allow the generation of numerous artificial databases identical to an original database, except for the links between properties, are defined. Only the links which are found in the original database and in less than 5% of the artificial databases are judged statistically significant, with a type I error risk of less than 5% (Snedecor et al., 1967), and produce significant association rules. This simulation technique is far more efficient than the acceptance-rejection method and allows the use of the associated randomization test in various databases

    Indexer, comparer, apparier des textes et leurs résumés : une exploration.

    Get PDF
    11 pagesNational audienceNous présentons ici la démarche qui nous a valu un score de 100% de réussite au défi DEFT 2011, et la première place ex-æquo, dans la tâche d'appariement de résumés avec des articles dépourvus d'introduction et de conclusion : nous avons testé plusieurs types d'indexation et de distance résumé-texte, et mis au point une méthode d'appariement, en univers fermé, robuste et sans nécessité d'information extérieure. En combinant quatre variantes de la distance de compression, indépendante de la langue et du type de codage, elle permet d'atteindre 93% ; les 100% sont atteints avec la distance de Hellinger appliquée à des textes indexés par des noms lemmatisés et des termes composés, distance qui surpasse ici la classique TF-IDF. Nous suggérons son application en univers ouvert, avec plus de textes que de résumés, et des résumés sans texte

    Recoder les variables pour obtenir un modèle implicatif optimal

    Get PDF
    International audienceA number of methods are available for deriving a categorization model of type XY out of a set of individual data, where X is a set of individual numerical features and Y their categories. We develop a brief overview of these methods by making use of the most popular ones for processing the well-known "Fisher’s Iris" dataset. The comparison of the resulting models encourages us to give preference to ISA (Implicative Statistical Analysis) for this specific type of data, on condition of a thorough recoding of the quantitative variables. This paper incorporates and expands a communication made during A.S.I.8 conference (Cadot et al. 2015) in which we show the interest of the chosen methodology (ISA after a specific recoding step) for the processing of acoustic data.Il existe un certain nombre de méthodes permettant d’obtenir à partir de données individuelles un modèle de catégorisation du type XY, X repré-sentant un ensemble de caractéristiques numériques des individus et Y leur ca-tégorie. Nous faisons un tour rapide de ces méthodes en appliquant les plus uti-lisées aujourd’hui au jeu de données des « Iris de Fisher ». La confrontation des divers modèles obtenus nous incite à privilégier l’A.S.I. (Analyse Statisti-que Implicative) pour traiter ce type de données, après un recodage particulier des variables quantitatives. Ce chapitre reprend et élargit une étude qui a fait l’objet d’une communication lors du colloque A.S.I.8 (Cadot et al. 2015) dans laquelle nous montrions l’intérêt de la méthodologie choisie (A.S.I. après re-codage particulier) pour le traitement de données acoustiques

    Random simulations of a datatable for efficiently mining reliable and non-redundant itemsets

    Get PDF
    International audienceOur goal is twofold: 1) we want to mine the only statistically valid 2-itemsets out of a boolean datatable, 2) on this basis, we want to build the only higher-order non-redundant itemsets compared to their sub-itemsets. For the first task we have designed a randomization test (Tournebool) respectful of the structure of the data variables and independant from the specific distributions of the data. In our test set (959 texts and 8477 terms), this leads to a reduction from 126, 000 2-itemsets to 13, 000 significant ones, at the 99% confidence interval. For the second task, we have devised a hierarchical stepwise procedure (MIDOVA) for evaluating the residual amount of variation devoted to higher-order itemsets, yielding new possible positive or negative high-order relations. On our example, this leads to counts of 7,712 for 2-itemsets to 3 for 6-itemsets, and no higher-order ones, in a computationally efficient way
    • …
    corecore